Skip to content

[improve] [test] Add a test to verify no orphan consumers if release lock caused by checkConnectionLiveness #21498

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Nov 1, 2023

Motivation & Modifications

Add a new test used to verify that the Broker will not leave an orphan consumer in the scenario below:

  • Register "consumer-1"
    • "consumer-1" will be maintained by the Subscription.
    • "consumer-1" will be maintained by the Dispatcher.
  • The connection of "consumer-1" has something wrong. We call this connection "connection-1"
  • Try to register "consumer-2"
    • "consumer-2" will be maintained by the Subscription. At this time, there are two consumers under this
      subscription.
    • This will trigger a connection check task for connection-1, we call this task "CheckConnectionLiveness".
      This task will be executed in another thread, which means it will release the lock Synchronized(dispatcher)
    • "consumer-2" was not maintained by the Dispatcher yet.
  • "CheckConnectionLiveness" will kick out "consumer-1" after 5 seconds, then "consumer-2" will be maintained
    by the Dispatcher.

(Highlight) Race condition: if the connection of "consumer-2" went to a wrong state before step 4,
"consumer-2" is maintained by the Subscription and not maintained by the Dispatcher. Would the scenario below
will happen?

  • "connection-2" closed.
  • Remove "consumer-2" from the Subscription.
  • Try to remove "consumer-2" from the Dispatcher, but there are no consumers under this Dispatcher. To remove
    nothing.
  • "CheckConnectionLiveness" is finished; put "consumer-2" into the Dispatcher.
  • At this moment, the consumer's state of Subscription and Dispatcher are not consistent. There is an orphan
    consumer under the Dispatcher.

Why there is no orphan consumer?

There are two mechanisms to avoid orphan consumers:

PersistentTopic.java

if (!cnx.isActive()) {
    try {
        consumer.close();
    } 
    ...
}

ServerCnx.java

if (consumerFuture.complete(consumer)) {
    ...
} else {
    consumer.close();
}

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Nov 1, 2023
@poorbarcode poorbarcode changed the title [improve] [test] Add a test to verify no orphan consumers if a race condition occur [improve] [test] Add a test to verify no orphan consumers if release lock caused by checkConnectionLiveness Nov 1, 2023
@poorbarcode poorbarcode self-assigned this Nov 1, 2023
@poorbarcode poorbarcode modified the milestones: 3.2.0, 3.0 Nov 1, 2023
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great initiative to add a test for the described scenario. I provided some review feedback mainly about the new test scaffolding.

import org.apache.pulsar.common.api.proto.CommandWatchTopicUpdate;
import org.apache.pulsar.common.util.netty.EventLoopUtil;

public class InjectedClientCnxClientBuilder {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea to have a separate class for this.

ClientConfigurationData conf = clientBuilder.getClientConfigurationData();
ThreadFactory threadFactory = new ExecutorProvider
.ExtendedThreadFactory("pulsar-client-io", Thread.currentThread().isDaemon());
EventLoopGroup eventLoopGroup =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would currently leak threads since it doesn't get closed. One possibility would be to override the closeAsync and shutdown methods of PulsarClientImpl like it was done in #21468 for PulsarTestClient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, I created a ulsarTestClient directly.

Comment on lines 95 to 102
new ClientCnxCustomizer(Type.PING){
@Override
public void handleCommand(Object command) {
if (command instanceof CommandPing) {
// do not response anything.
}
}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ClientCnxCustomizer any easier than providing an actual anonymous ClientCnx class?

Passing a BiFunction<ClientConfigurationData, EventLoopGroup, ClientCnx> as a parameter in the InjectedClientCnxClientBuilder should be sufficient to support anything that is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your suggestion is better.

I created a new implementation here, after the PR merged, I will back to finish this one.

Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return new PulsarClientImpl(conf, eventLoopGroup, pool);
}

public static abstract class ClientCnxCustomizer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason to have ClientCnxCustomizer and InjectedClientCnx. Simply providing a BiFunction<ClientConfigurationData, EventLoopGroup, ClientCnx> parameter for creating the ClientCnx instance should be sufficient.
For example, you'd use it somewhat like this:

.clientCnxFunction((conf, eventLoopGroup) -> {
    return new ClientCnx(conf, eventLoopGroup) {
         @Override
        protected void handlePing(CommandPing ping) {
            // do not respond to CommandPing
        }
       };
});

The PulsarTestClient shows how it could be implemented. PulsarTestClient is specialized for a certain test, but it contains some useful ideas how clientCnxFunction could be used to instantiate the ClientCnx when it's needed.

Comment on lines +109 to +106
// Wait for all commands of the consumer c1 have been handled. To avoid the Broker mark the connection is active
// after it receive anything.
Thread.sleep(1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we know that 1000ms is sufficient?

@poorbarcode poorbarcode force-pushed the improve/test_orphan_consumer branch from b9a97a7 to 568431c Compare December 19, 2023 08:26
@Technoboy- Technoboy- modified the milestones: 3.2.0, 3.3.0 Dec 22, 2023
@coderzc coderzc modified the milestones: 3.3.0, 3.4.0 May 8, 2024
@lhotari lhotari modified the milestones: 4.0.0, 4.1.0 Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs release/3.0.12
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants